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Abstract 

Prediction polling is an increasingly popular form of crowdsourcing in which multiple 
participants estimate the probability or magnitude of some future event. These estimates 
arc then aggregated into a single forecast. Historically, randomness in scientific estimation 
has been generally assumed to arise from unmeasured factors which arc viewed as mea¬ 
surement noise. However, when combining subjective estimates, heterogeneity stemming 
from differences in the participants’ information is often more important than measure¬ 
ment noise. This paper formalizes information diversity as an alternative source of such 
heterogeneity and introduces a novel modeling framework that is particularly well-suited 
for prediction polls. A practical specification of this framework is proposed and applied 
to the task of aggregating probability and point estimates from two real-world prediction 
polls. In both cases our model outperforms standard measurement-error-based aggrega¬ 
tors, hence providing evidence in favor of information diversity being the more important 
source of heterogeneity. 
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1. INTRODUCTION 


Past literature has distinguished two types of polling: prediction and opinion polling. In broad 
terms, an opinion poll is a survey of public opinion, whereas a prediction poll involves multiple 


agents collectively predicting the value of some quantity of interest (Goel et al., 2010; Mellers 


et al., 2014). For instance, consider a presidential election poll. An opinion poll typically 


asks the voters who they will vote for. A prediction poll, on the other hand, could ask which 
candidate they think will win in their state. A liberal voter in a dominantly conservative state 
is likely to answer differently to these two questions. Even though opinion polls have been the 
dominant focus historically, prediction polls have become increasingly popular in the recent 
years, due to modern social and computer networks that permit the collection of a large number 
of responses both from human and machine agents. This has given rise to crowdsourcing 
platforms, such as MTurk and Witkey, and many companies, such as Myriada, Lumenogic, 
and Inkling, that have managed to successfully capitalize on the benefits of collective wisdom. 

This paper introduces statistical methodology designed specifically for the rapidly growing 
practice of prediction polling. The methods are illustrated on real-world data involving two 
common types of responses, namely probability and point forecasts. The probability forecasts 
were collected by the Good Judgment Project (GJP) (Un gar et al.|2012(|Mellers et al.|2014| ) as 
a means to estimate the likelihoods of international political future events deemed important by 
the Intelligence Advanced Research Projects Activity (IARPA). Since its initiation in 2011, the 
project has recruited thousands of forecasters to make probability estimates and update them 
whenever they felt the likelihoods had changed. To illustrate, Figure [T] shows the forecasts for 
one of these events. This example involves 522 forecasters making a total of 1, 669 predic¬ 
tions between 30 July 2012 and 30 December 2012 when the event finally resolved as “No” 
(represented by the red line at 0.0). In general, the forecasters reported updates very infre¬ 
quently. Furthermore, not all forecasters made probability estimates for all the events, making 
the dataset very sparse. The point forecasts for our second application were collected by Moore 




















Figure 1: Probability forecasts of the event 
“Will Moody’s issue a new downgrade on the 
long-term ratings for any of the eight major 
French banks between 30 July 2012 and 31 
December 2012?” The points have been jit¬ 
tered slightly to make overlaps visible. 
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Figure 2: Point forecasts of the weights of 
20 different people. The boxplots have been 
sorted to increase in the true weights (red 
dots). Some extreme values were omitted for 
the sake of clarity. 


and Klein (2008) who recruited 416 undergraduates from Carnegie Mellon University to guess 


the weights of 20 people based on a series of pictures. This is an experimental setup where each 
participant was required to respond to all the questions, leading to a fully completed dataset. 
The responses are illustrated in Figure [2]that shows the boxplots of the forecasters’ guesses for 
each of the 20 people. The red dots represent the corresponding true weights. 

Once the predictions have been collected, they are typically combined into a single con¬ 
sensus forecast for the sake of decision-making and improved accuracy. Unfortunately, this 
can be done in many different ways, and the final combination rule can largely determine the 
out-of-sample performance. The past literature distinguishes two broad approaches to forecast 
aggregation: empirical aggregation and model-based aggregation. Empirical aggregation is by 
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far the more widely studied approach; see, e.g., stacking (Breiman] |1996) , Bayes model averag¬ 
ing ( Raftery et al.[ 1997), linear opinion pools (DeGroot and Mortera 199 1| ), and extremizing 
aggregators ( |Ranjan and Gneiting[ |20 1 0| |Satopaa et al.[|2014a|b[ ). All these methods are akin 
to machine learning in a sense that they first learn the aggregator based on a training set of 
past forecasts of known outcomes and then use that aggregator to combine future forecasts of 
unknown outcomes. Unfortunately, in a prediction polling setup, constructing such a training 
set requires a lot of effort and time on behalf of the forecasters and the polling agent. There¬ 
fore a training set is often not available. Instead, the participants are typically handed a single 
questionnaire that simultaneously inquires about their predictions of one or more unknown 
outcomes. This leads to a dataset consisting only of forecasts, which means that empirical 
aggregation cannot be applied. 

Fortunately, model-based aggregation can be performed even when prior knowledge of 
outcomes is not available. This approach begins by proposing a plausible probability model 
for the source of heterogeneity among the forecasts, that is, for how and why the forecasts 
differ from the target outcome. Under this assumed forecast-outcome link, it is then possible 
to construct an optimal aggregator that can be applied directly to the forecasts without learning 
the aggregator first from a separate training set. Given this broad applicability, the current paper 
focuses only on the model-based approach. In particular, outcomes are not assumed available 
for aggregation at any point in the paper. Instead, aggregation is performed solely based on 
forecasts, leaving all empirical techniques well outside the scope of the paper. 

Historically, potentially due to early forms of data collection, model-based aggregation 
has considered measurement error as the main source of forecast heterogeneity. This choice 
motivates aggregators with central tendency such as the (weighted) average, median, and so on. 
Intuitively, measurement error may be reasonable in modeling repeated estimates from a single 
instrument. However, it is unlikely to hold in prediction polling, where the estimates arise from 
multiple, often widely different sources. It is also known that a non-trivial weighted average is 
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not the optimal aggregator (in terms of the expected quadratic and many other loss functions) 


under any joint distribution of the outcome and its (conditionally unbiased) forecasts (Dawid 


et al.[|199~5t|Ranjan and Gneiting[|20lOj|Satopaa and Ungar[|2015| ). This questions the role of 


measurement error in model-based aggregation and highlights the need for a different source 
of forecast heterogeneity. 

The main contribution of this paper is a new source of forecast heterogeneity, called in¬ 
formation diversity, that explains variation by differences in the information available to the 
forecasters and how they decide to use it. For instance, forecasters studying the same (or dif¬ 
ferent) articles about a company may use separate parts of the information and hence report 
differing predictions on the company’s future revenue. Such diversity forms the basis of a 
novel modeling framework known as the partial information framework. Theory behind this 
framework was originally introduced for probability forecasts by Satopaa et al.[( 2015| ); though 
their specification is somewhat restrictive for empirical applications. The current paper gen¬ 
eralizes the framework beyond probability forecast and removes all unnecessary assumptions, 
leading to a new specification that is more appropriate for practical applications. This specifi¬ 
cation allows the decision-maker to build models for different types of forecast-outcome pairs, 
such as probability forecasts of binary events or point forecasts of real-valued outcomes. Each 
such model motivates and describes an explicit joint distribution for the target outcome and its 
forecasts. The optimal aggregator under this joint distribution is available and serves as a more 
principled model-based alternative to the usual (weighted) average or median. 

The paper is structured as follows. Section [2] first describes the partial information frame¬ 
work at its most general level and then introduces a practical specification of the framework. 
The section ends with a brief review of previous work on model-based aggregation. Section [3] 
derives a general procedure that guides efficient estimation of the information structure among 
the forecasters. Section [4] illustrates on real-world data how specific models within the frame¬ 
work can be constructed and applied. In particular, the models are derived and evaluated on 
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probability and point forecasts from the two prediction polls discussed above. Overall, the 
resulting partial information aggregators achieve a noticeable performance improvement over 
the common measurement-error-based aggregators, suggesting that information diversity is the 
more appropriate model of forecast heterogeneity. Finally, Section[5]concludes with a summary 
and discussion of future research. 


2. MODEL-BASED AGGREGATION 


2.1 Bias and Noise 


Consider N forecasters and suppose forecaster j predicts X :j for some quantity of interest Y. 
For instance, in our weight estimation example Y is the true weight of a person and X ;J is the 
guess given by the jth undergraduate. In our probability forecasting application, on the other 
hand, Y is binary, reflecting whether the event happens or not, and X :) e [0,1] is a probability 
forecast for its occurrence. This section, however, avoids such application specific choices 
and treats Y and Xj as generic random variables. In general, prediction Xj is nothing but 
an estimator of Y. Therefore, as is the case with all estimators, its deviation from the truth 
can be broken down into two components: bias and noise. On the theoretical level, these two 
components can be separated and hence are often addressed by different mechanisms. This 
suggests a two-step approach to forecast aggregation: i) eliminate any bias in the forecasts, and 
ii) combine the unbiased forecasts. 

Historically, bias in human judgment has been extensively studied in the psychology litera¬ 
ture (for reviews, see |Lic htenstei n~et al. 1 1 977} | Yates| 1 990}|Keren| 1991 1 ) . This bias often exhibits 
well-known patterns (see, e.g., the easy-hard effect in Lichtenstein and Fischhoff 1977; Juslin 


1993), and many authors have proposed both cognitive and motivational models to explain it 
( |Koriat et~ak| |1980} |KruglanskH |1990} |Soll[ |1996} |Moore and Healy[ |2008| ) . These models and 
other results in this popular area of research suggest ways for ex-ante bias reduction. Such 
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techniques, however, are not in the scope of this paper. Instead, the focus here is on noise 
reduction and hence specifically on developing methodology for the second step in the overall 


process of forecast aggregation. In particular, Section |2.2| describes our new framework for 
modeling the noise component. This is then compared in Section [23] to previous noise models. 
These models make different assumptions about the way the unbiased forecasts relate to the 
target outcome and hence motivate very different classes of model-based aggregators. 


2.2 Partial Information Framework 

2.2.1 General F rame work 

The partial information framework assumes that Y and Xj are measurable under some common 
probability space (fi, J~. P). The probability measure P provides a non-informative yet proper 
prior on Y and reflects the basic information known to all forecasters. Such a prior has been 
discussed extensively in the economics and game theory literature where it is usually known as 
the common prior. Even though this is a substantive assumption in the framework, specifying 
a prior distribution cannot be avoided as long as the model depends on a probability space. 
This includes essentially any probability model for forecast aggregation. How the prior is 
incorporated depends on the problem context: it can be chosen explicitly by the decision¬ 
maker, computed based on past observations of Y , or estimated directly from the forecasts. 

The principal a-field T can be interpreted as all the possible information that can be known 
about Y. On top of the basic information reflected in the prior, the jth forecaster uses some 
personal partial information set T t C T and predicts X } = E(Y \ Xj). Therefore J~, f Tj if 
X, f Xj, and forecast heterogeneity stems purely from information diversity. Note, however, 
that if forecaster j uses a simple rule, T :J may not be the full cr-field of information available 
to the forecaster but rather a smaller cr-field corresponding to the information used by the 
rule. Furthermore, if two forecasters have access to the same cr-field, they may decide to use 
different sub-cr-fields, leading to different predictions. This is particularly salient in our weight 
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estimation example where each forecaster has access to the exact same information, namely 
the picture of the person, but can choose to use different subsets of this information. Therefore, 
information diversity does not only arise from differences in the available information, but also 


from how the forecasters decide to use it. This general point of view was motivated in Satopaa 


et al. (2015) with simple examples that illustrate how the optimal aggregate is not well-defined 


without assumptions on the information structure among the forecasters. 


Satopaa et al. (2015) also show that X 3 = E(Y | T 3 ) is precisely the same as having a 


calibrated (sometimes also known as reliable) forecast, that is, X 3 = E(Y\X 3 ). Therefore the 
form Xj = E(Y| Xj) arises directly from the existence of an underlying probability model and 
calibration. Overall, calibration Xj = E(Y|Xj) has been widely discussed in the statistical 
and meteorological forecasting literature (see, e.g., Dawid et~aLl|1995| |Ranjan and Gneiting 


2010; |Broecker||20 12), with traces at least as far back as Murphy and Winkler (1987). Given 


that the condition X 3 = E(Y|X,j depends on the probability measure P, it should be referred 
to as P-calibration when the choice of the probability measure needs to be emphasized. This 
dependency shows the main conceptual difference between P-calibration and the notion of 
empirical calibration (Dawid|1982[ Foster and Yohra 1998 and many others). However, as was 


pointed out by Dawid et al. (1995 ), these two notions can be expressed in formally identical 
terms by letting P represent the limiting joint distribution of the forecast-outcome pairs. 

In practice researchers have discovered many calibrated subpopulations of experts, such 
as meteorologists (Murphy and Winkler, 1977ajb), experienced tournament bridge players 
(Keren,JT987J), and bookmakers (Dowie| [l976| ). Generally, calibration can be improved through 
team collaboration, training, tracking (Mellers et al., 2014), performance feedback (Murphy 


and Daanj |T984), representative sampling of target events (|Gigerenzer et al.[ 1 1991 ^ |Juslin 


1993), or by evaluating the forecasters’ performance under a loss function that is minimized 


by the conditional expectation of Y, given the forecaster’s information (Banerjee et al. 2005) . 
If one is nonetheless left with uncalibrated forecasts, they can be calibrated ex-ante as follows. 
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First, consider some (possibly uncalibrated) forecasts X = (X 1; ..., X N )' defined on (fl, T). 
Choose some distribution Q for (Y, X). For instance, 


Dawid et al. 


1995 


suggest first choos¬ 


ing a distribution Q for X and then setting Q(K X) = T(X)Q(X), where Y is an arbitrary 
aggregator (such as the average of probability forecasts of a binary event) acting as Q(Y|X). 
Alternatively, one may search for an appropriate Q in the large literature of quantitative psy¬ 
chology. Regardless how Q is constructed, however, the calibrated version of X, is Eq(Y'|X,). 
This forecast is Q-calibrated and can be written as Eq(Y'| JQ ), where T t = a(Eq( Y'| AQ)) is the 
cr-field generated by Eq(Y|AQ-). Intuitively, calibrating is equivalent to replacing forecast x by 
Eq(Y \Xj = x ) for all possible values x e supp(X,). Perhaps, however, one does not want to 
work under this particular model. To accommodate alternative models (such as the Gaussian 


model described in Section 2.2.2), the next proposition shows how Q-calibrated forecasts can 
be transformed into forecasts that are calibrated under some other probability measure P. All 
the proofs are deferred to Appendix A. 

Proposition 2.1. Consider a probability measure P such that P < 0. Let ^ denote the Radon- 
Nikodym derivative of P with respect to Q. The forecasts under the new model P are then given 
by the transformation E P (Y|J 7 ) = Eq /Eq where Tj = ct(Eq(Y|AQ)). 

This shows that uncalibrated forecasts from “non-experts” can be calibrated as long as 
one agrees on some joint distribution for the target outcome and its forecasts. While such 
constructs certainly deserve further analysis, they are not in the scope of this paper and hence 
are left for future work. Therefore, from now on, the forecasts are assumed to be calibrated. 
Note, however, that in general the forecasts should satisfy some minimal performance criterion; 
simply aggregating entirely arbitrary forecasts is hardly going to lead to improved forecasting 


accuracy. To this end, Foster and Vohra (1998) analyze probability forecasts and state that 
“calibration does seem to be an appealing minimal property that any probability forecast should 
satisfy.” They show that one needs to know almost nothing about the outcomes in order to be 
calibrated. Thus, in theory, calibration can be achieved very easily and overall seems like an 
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appropriate base assumption for developing a general theory of forecast aggregation. 

Given that the partial information framework generates all forecast variation from infor¬ 
mation diversity, it is important to understand the extent to which the forecasters’ partial in¬ 
formation sets can be measured in practice. First, note that, for the purposes of aggregation, 
any available information discarded by a forecaster may as well not exist because information 
comes to the aggregator only through the forecasts. Therefore it is not in any way restrictive to 
assume that Tj = cr(Xj). Second, the following proposition describes observable measures for 
the amount of information in each forecast and for the amount of information overlap between 
any two forecasts. 

Proposition 2.2. If Tj = cr(Xj ) such that E( V| V)) = E(Y\Xj) = Xj for all j = 1,..., N, 
then the following holds. 

i) Forecasts are marginally consistent: E(V') = 'L(Xf). 

ii) Variance increases in information: Var (X, ) < Var (Xf) if T, C T r Given that Y = 
E(T'|J r ), the variances of the forecasts are upper bounded as Var (Xj) < Var (Y) for all 
j = 1,... ,7V. 

Hi) Cov (Xj, X,) = Var (A 7 "*) if T, C T r Again, expressing Y = E(V| X) implies that 
Cov (Xj, Y) = Var (Xj) for all j = 1,..., N. 

This proposition is important for multiple reasons. First, item[I]) provides guidance in esti¬ 
mating the prior mean of Y from the observed forecasts. Second, item[n]) shows that Var (Xj) 
quantifies the amount of information used by forecaster j. In particular, Var (Xj) increases to 
Var (V) as forecaster j leams and becomes more informed. Therefore increased variance re¬ 
flects more information and is deemed helpful. This is a clear contrast to the standard statistical 
models that often regard higher variance as increased noise and hence harmful. The covariance 
Cov (Xj, Xj), on the other hand, can be interpreted as the amount of information overlap be¬ 
tween forecasters i and j. Given that being non-negatively correlated is not generally transitive 
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(Langford et al. [2001] ), these covariances are not necessarily non-negative even though all fore¬ 
casts are non-negatively correlated with the outcome. Such negatively correlated forecasts can 
arise in a real-world setting. For instance, consider two forecasters who see voting preferences 
of two different sub-populations that are politically opposed to each other. Each individually is 
a weak predictor of the total vote on any given issue, but they are negatively correlated because 
of the likelihood that these two blocks will largely oppose each other. 


Third and finally, item iii) shows that the covariance matrix Tix of the XjS extends to the 
unknown Y as follows: 


Cov((y,x 1 ,...,x Ar )') = 


Var(F) diag(S x ) 
diag(Sx) 


( 1 ) 


where diag(Sx) denotes the diagonal of S v . This is the key to regressing Y on the X :) s 
without a separate training set of past forecasts of known outcomes. The resulting estimator, 
called the revealed aggregator , is 

x":= E(y l-Xi,..., X N ) = E (y | F") , 


where T" := .... X N ) is the a-field generated (or information revealed) by the XjS. The 

revealed aggregator uses all the information that is available in the forecasts and hence is the 
optimal aggregator under the distribution of (Y. X \,..., X N ). To make this precise, consider a 
scoring rule S(x, y) that represents the loss of predicting x when the outcome is y. A scoring 
rule is said to be consistent for the mean of y if Ey[S'(E r (y), y)] < E Y [S^rr, y)] for all x G R. 


Savage ( [1971] ) showed, subject to weak regularity conditions, that all such scoring rules can be 
written in the form 


S(x,y) = <j)(y ) -<j>(x) - (f>'{x){y - x), 


( 2 ) 
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where 0 is a convex function with subgradient 0'. An important special case is the quadratic 
loss S(x, y) — (x — y) 2 that arises when 0(x) = x 2 . Now, if an aggregator is defined as any 
random variable X 6 a(X i,..., Xn), then X" is an aggregator that minimizes expectation of 
any scoring rule S of the form ([2]): 

m*,Y)} = £x u ...*A£y\x 1 ,...jc it [S(X,Y)]} 

= E[S(X",Y)]. 


Ranjan and Gneiting ()2010) showed a similar results for probability forecasts. For these rea¬ 


sons, X" is considered the relevant aggregator under each specific instance of the framework. 
The next section shows how this aggregator can be captured in practice. 


2.2.2 Gaussian Partial Information Model 


Even though the general framework is convenient for theoretical analysis, it is clearly too 
abstract for practical applications. Fortunately, applying the framework in practice only re¬ 
quires one extra assumption, namely the choice of a parametric family for the distribution 


of (Y, Xi,..., X N ). One approach is to refer to Proposition 2.2 and choose a family that is 


parametrized in terms of the first two joint moments. This points at the multivariate Gaus¬ 
sian distribution that is a typical starting point in developing statistical methodology and often 
provides the cleanest entry into the issues at hand. 

The Gaussian distribution is also the most common choice for modeling measurement er¬ 
ror. This is typically motivated by assuming the terms to represent sums of a large number of 
independent sources of error. The central limit theorem then gives a natural motivation for the 
Gaussian distribution. A similar argument can be made under the partial information frame¬ 
work. First, consider some pieces of information. Each piece either has a positive or negative 
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impact and hence respectively either increases or decreases Y. The total sum (integral) of these 
pieces determines the value of Y. Each forecaster, however, only observes the sum of some 
subset of them. Based on this sum, the forecaster makes an estimate of Y. If the pieces are 
independent and have small tails, then the joint distribution of the forecasters’ observations 
will be asymptotically Gaussian. Given that the number of information pieces in a real-world 
setup is likely to be large, it makes sense to model the forecasters’ observations as jointly Gaus¬ 
sian. Of course, other distributions, such as the multivariate /-distribution, are possible. At this 
point, however, such alternative specifications are best left for future work. 

The model variables (Y, X\,..., X^) can be modeled directly with a Gaussian distribution 
as long as they are all real-valued. In many applications, however, Y and X 3 may not be 
supported on the whole real line. For instance, the aforementioned Good Judgment Project 
collected probability forecasts of binary events. In this case, X 3 e [0,1] and Y e {0,1}. 
Fortunately, different types of outcome-forecast pairs can be easily addressed by borrowing 
from the theory of generalized linear models (McCullagh and Nelder[ 1989) and utilizing a 
link function. The result is a close yet widely applicable specification called the Gaussian 
partial information model. This model begins by introducing N + 1 information variables that 
follow a multivariate Gaussian distribution with the covariance pattern ([]]): 


(z \ 


Z\ 

~ A/jv+i 

\ Zn ) 
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0, 


1 diag(S)' 
diag(S) £ 
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Si 
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Pat ,2 • • 
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\ 


(3) 


This distribution supports the Gaussian model similarly to the way the ordinary linear regres¬ 
sion supports the class of generalized linear models. In particular, the information variables 
transform into the outcome and forecasts via an application-specific link function gf); that 
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is, Y = g(Z 0 ) and X 3 = E(Y\Zj) = E(g(Z 0 )\Zj). Given that Z 0 fully determines V, it is 
sufficient for all information that can be known about Y. The remaining variables Z\..... Z^, 
on the other hand, summarize the forecasters’ partial information. To make this more concrete, 
consider our two real-world applications. For probability forecasts of a binary event a reason¬ 
able link function g(-) is the indicator function I 4 , where A = {Z 0 > t} for some threshold 
value tel R. For real-valued X, and Y, on the other hand, a reasonable choice is the reverse 
standardizing function g(Z 0 ) = oqZq + Ho, where // 0 and a 0 are the prior mean and standard 
deviation of Y, respectively. In general, it makes sense to have g (•) map from the real-numbers 
to the support of Y such that Y has the correct prior P(Y). 

Overall, this model can be considered as a close yet practical specification of the gen¬ 
eral framework. After all, it only adds on the assumption of Gaussianity. This extra as¬ 
sumption, however, is enough to allow the construction of the revealed aggregator X" = 
E(Y\Zi, .... Zy). For X" and also X 3 the conditional expectations can be often computed 
via the following conditional distributions: 


Z Q \Zj ~ J\f (Zj, 1 — 8j) and 

Z 0 \Z ~ N (diag(S) , S” 1 Z, 1 - diag(S) / S- 1 diag(S)) , 


where Z = [Z 1 , ..., Z N )'. For instance, if both X 3 and Y are real-valued, then X 3 = a 0 Z 7 + // 0 
and X" = diag(S) / S _1 (X — ^qIn) + Ho, where X = (Xi,..., A^v)'. These conditional 
distributions arise directly from the well-known conditional distributions of the multivariate 
Gaussian distribution (see, e.g., Ravishanker and Dey|2001|). 
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2.3 Previous Work on Model-Based Aggregation 


2.3.1 Interpreted Signal Framework 


The interpreted signal framework is a behavioral model that assumes different predictions to 
arise from differing interpretation procedures (Hong and Page] 2009). For example, consider 
two forecasters who visit a company and predict its future revenue. One forecaster may care¬ 
fully examine the company’s technological status while the other pays closer attention to what 
the managers say. Even though the forecasters receive and possibly even use the exact same 
information, they may interpret it differently and hence end up reporting different forecasts. 
Therefore forecast heterogeneity is assumed to stem from “cognitive diversity”. 

This is a very reasonable model and hence has been used in various forms to simulate and 
illustrate theory about expert behavior (see, e.g., |Broomell and Budescu||2009| [Parunak et al. 


2013). Consequently, previous authors have constructed many highly specialized toy models 


of interpreted forecasts. For instance, Dawid et al. (1995) construct simple models of two 


forecasts to support their discussion on coherent forecast aggregation; Ranjan and Gneiting 


(2010) use one of these models to simulate calibrated forecasts; and Di Bacco et al. (2003) 


introduce a model for two forecasters whose (interpreted) log-odds predictions follow a joint 
Gaussian distribution. Unfortunately, their model is very narrow due to its detailed assumptions 
and extensive computations. Furthermore, it is not clear how the model can be used in practice 
or extended to N forecasters. All in all, it seems that successful previous applications of the 
interpreted signal framework have used it as a basis for illustrating theory instead of actually 
aiming to model real-world forecasts. In this respect, the framework has remained relatively 
abstract. 

Our partial information framework, however, formalizes the intuition behind it, allows 
quantitative predictions, and provides a flexible construction for modeling many different fore¬ 
casting setups. Overall, the framework is very general and, in fact, encompasses all the other 
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authors’ models mentioned above as different sub-cases. Unlike the Gaussian model, how¬ 
ever, these models make many restrictive assumptions in addition to just choosing a parametric 


family. Even though the general partial information framework, as described in Section 2.2 


does not allow the forecasters to interpret information differently and hence does not capture 
all aspects of the interpreted signal framework, personal interpretations can be easily intro¬ 
duced by associating forecaster j with a probability measure P :/ that describes that forecaster’s 
interpretation of information. If E j denotes the expectation under P 7 , then it is possible that 
Xi = Ei(y|7i) f Xj = Ej(Y\J 7 j) even if T t = T r In practice, however, eliciting the details 
of each P j is hardly possible. Therefore, to keep the model tractable, it is convenient to assume 
a common interpretation P 7 = P for all 


2.3.2 Measurement Error Framework 

In the absence of a quantitative interpreted signal model, prior applications have typically ex¬ 
plained forecast heterogeneity with standard statistical models. These models are different for¬ 
malizations of the measurement error framework that generates forecast heterogeneity purely 
from a probability distribution. More specifically, this framework assumes a “true” (possibly 
transformed) forecast 6, which can be interpreted as the prediction made by an ideal forecaster. 
The forecasters then somehow measure 6 with mean-zero idiosyncratic error. For instance, in 
our probability forecasting application one possible measurement error model is 


Y ~ Bernoulli (6), 

logit (Xj) = logit (6) + ej, and (4) 

£ j A/"(0, a 2 ) for all j = 1,..., N, 

where logit(.x) = log(.i'/(1 — x )) is the log-odds operator. Given that the errors are generally 
assumed to have mean zero, measurement error forecasts are unbiased estimates of 6, that is, 
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E(X, |$) = 6. Observe that this is not the same as assuming calibration E(Y|X,) = Xj. There¬ 
fore an unbiased estimation model is very different from a calibrated model. This distinction is 
further emphasized by the fact that X” never reduces to a (non-trivial) weighted average of the 
forecasts (Satopaa and Ungarj j2015 ). Given that the measurement-error aggregators are often 
different types of weighted averages, measurement error and information diversity are not only 
philosophically different but they also require very different aggregators. 

Example Q illustrates the main advantages of the measurement error framework: simplic¬ 
ity and familiarity. Unfortunately, there are a number of disadvantages. First, measurement- 
error aggregators estimate 9 instead of the realized value of the random variable Y. For this 
reason, these aggregators often do not satisfy even the minimal performance requirements. 
For instance, a non-trivial weighted average of calibrated forecasts is necessarily both uncal¬ 
ibrated and under-confident ( jRanjan and Gneiting||201Cty|Satopaa and Ungar||2015[ ). Second, 
the standard assumption of conditional independence of the observations forces a specific and 
highly unrealistic structure on interpreted forecasts (Hong and Page] 2009). Measurement- 
error aggregators also cannot leave the convex hull of the individual forecasts, which further 
contradicts the interpreted signal framework (Parunak et al., 2013[ ) and can be easily seen to 
result in poor empirical performance on many datasets. Third, the underlying model is rather 
implausible. Relying on a true forecast 9 invites philosophical debate, and even if one assumes 
the existence of such a value, it is difficult to believe that the forecasters are actually seeing 
it with independent noise. Therefore, whereas the interpreted signal framework proposes a 
plausible micro-level explanation, the measurement error model does not; at best, it forces us 
to imagine a group of forecasters who apply the same procedures to the same data but with 
numerous small mistakes. 
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3. MODEL ESTIMATION 


This section describes methodology for estimating the information structure £. Even though 
£ is mostly used for aggregation, it also describes the information among the forecasters (see 


end of Section 2.2.1) and hence should be of interest to decision analysts, psychologists, and 
the broader community studying collective problem solving. Unfortunately, estimating £ in 
full generality based on a single prediction per forecaster is difficult. Therefore, to facilitate 
model estimation, the forecasters are assumed to predict K > 2 related events. For instance, 
in our second application 416 undergraduates guessed the weights of 20 people. This yielded 
a 20 x 416 matrix that was then used to estimate £. 


3.1 General Estimation Problem 

Denote the outcome of the kth event with Y k and the jth forecaster’s prediction for this outcome 
with Xj k . For the sake of generality, this section does not assume any particular link function 
but instead operates directly with the corresponding information variables, denoted with Z jk . 
In practice, the forecasts X jk can be often transformed into Zj k at least approximately. This is 
illustrated in Section |4j Recall that aggregation cannot access to the outcomes { Y] ,..., Y K ] or 
their corresponding information variables {Z 0 i ,..., Z 0K j. Instead, £ is estimated only based 
on (Zi,..., Z K }, where the vector Z k = (Z\ k , ..., Z Nk )' collects the forecasters’ information 
about the kth event. 

This estimation must respect the covariance pattern (]3]). More specifically, if Sf denotes 
the set of N x N symmetric positive semidefinite matrices and 

( 1 diag(M)'\ 

MM) : = 

ydiag(M) M J 

for some symmetric matrix M, then the final estimate must satisfy the condition /i(£) e S+ +1 . 
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Intuitively, this is satisfied if there exists a random variable Y for which the forecasts Xj are 
jointly calibrated. In terms of information, this means that it is physically possible to allocate 
information about Y among the N forecasters in the manner described by £. Therefore the 
condition is named information coherence. 

Unfortunately, simply finding an accurate estimate of £ does not guarantee precise ag¬ 


gregation. To see this, recall from Section |2.2.2| that E(Z 0 fc|Z fc ) = diag(£)'£ _1 Z fc . This 
term is generally found in the revealed aggregator and hence deserves careful treatment. Re¬ 
express the term as v'Z fe , where v is the solution to diag(£) = £v. The rate at which 
the solution changes with respect to a change in diag(£) depends on the condition number 
cond(£) := A max (£)/Ai.e., the ratio between the maximum and minimum eigenval¬ 
ues of £. If the condition number is very large, a small error in diag(£) can cause a large error 
in v. If the condition number is small, £ is called well-conditioned and error in v will not be 
much larger than the error in diag(£). Thus, to prevent estimation error from being amplified 
during aggregation, the estimation procedure should require cond(£) < k for a given threshold 
k > 1 . 

This all gives the following general estimation problem: 


minimize f 0 (£, {Zi,..., Z k }) 

subject to /i(£) G S+ +1 , and (5) 

cond(£) < k, 

where f 0 is some objective function. The feasible region defined by the two constraints is 
convex. Therefore, if / 0 is convex in £, expression (J5]) is a convex optimization problem. 
Typically the global optimum to such a problem can be found very efficiently. Problem ([5]), 
however, involves ( N ^ ') variables. Therefore it can be solved efficiently with standard opti¬ 
mization techniques, such as the interior point methods, as long as the number of variables is 
not too large, say, not more than 1,000. Unfortunately, this means that the procedure cannot be 
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applied to prediction polls with more than about N — 45 forecasters. This is very limiting as 
many prediction polls involve hundreds of forecasters. For instance, our two real-world appli¬ 
cations involve 100 and 416 forecasters. Fortunately, by choosing the loss function carefully 
one can perform dimension reduction and estimate £ under a much larger N. This is illustrated 
in the following subsections. 


3.2 Maximum Likelihood Estimator 


Under the Gaussian model the information structure £ is a parameter of an explicit likeli¬ 
hood. Therefore estimation naturally begins with the maximum likelihood approach (MLE). 
Unfortunately, the Gaussian likelihood is not convex in £. Consequently, only a locally opti¬ 
mal solution is guaranteed with standard optimization techniques. Furthermore, it is not clear 


whether the dimension of this form can be reduced. Won and Kim (2006) discuss the MLE un¬ 


der a condition number constraint. They are able to transform the original problem with ( ;V j 1 ) 
variables to an equivalent problem with only N variables, namely the eigenvalues of £. This 
transformation, however, requires an orthogonally invariant problem. Given that the constraint 
/i(£) e 6' | v+1 is not orthogonally invariant, the same dimension-reduction technique cannot 
be applied. Instead, the MLE must be computed with the ( ,v ^ 1 ) variables, making estimation 
slow for small N and undoable even for moderately large N. For these reasons the MLE is not 
discussed further in this paper. 


3.3 Least Squares Estimator 

Past literature has discussed many simple covariance estimators that can be applied efficiently 
to large amounts of data. Unfortunately, these estimators are not guaranteed to satisfy the 
conditions in ([5]). This section introduces a correctional procedure that inputs any covariance 
estimator S and modifies it minimally such that the end result satisfies the conditions in ([5]). 
More specifically, S is projected onto the feasible region. This approach, sometimes known as 
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the least squares approach (LSE), motivates a convex loss function that guarantees a globally 
optimal solution and facilitates dimension reduction. Most importantly, however, it provides 
a general tool for estimating £, regardless whether one is working with a Gaussian model or 
possibly some future non-Gaussian model. 

From the computational perspective, it is more convenient to project h( S) instead of S. 
Even though this could be done under many different norms, for the sake of simplicity, this 
paper only considers the squared Frobenius norm ||M|||, = tr(M ; M), where tr(-) is the trace 
operator. The LSE is then given by /r -1 (f2), i.e., il without the first row and column, where 
is the solution to 


minimize ||fi — /i(S')||f. 
subject to S7 e S+ +1 , 

( 6 ) 

cond(S7) < k, and 

tr(A jn) = bj, (j — 1,N + 1). 

Both A j and bj are constants defined to maintain the covariance pattern ([3]>. More specifically, 
if e 3 denotes the jth standard basis vector of length N + 1, then 

hi = 1 , Ai = ei , and 

b 3 = 0, A 3 = eje'j — 0.5(eie'- + e J e , 1 ) for j = 2,..., N + 1. 

If ft satisfies the other two conditions, namely ft e 5 ^ +1 and cond(S7) < k, then S = hr 1 ( S2) 
also satisfies them. This follows from the fact that S is a principal sub-matrix of fi. Therefore 
S2 e 5 ,v 1 1 implies S e S A . Furthermore, Cauchy’s interlace theorem (see, e.g., Hwang 2004) 
states that A min (f2) < A min (S) and X max (T,) < A maa; (i2) such that cond(S) < cond(12) < k. 
Of course, requiring cond(f2) < k instead of cond(S) < n shrinks the region of feasible Ss. 
At this point, however, the exact value of k is arbitrary and merely serves to control cond(S). 


Hwang 2004 
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Section 3.4 introduces a procedure for choosing k from the data. Under such an adaptive 
procedure, problem ([ 6 ]) can be considered equivalent to directly projecting S onto the feasible 
region. 

The first step towards solving (| 6 ]) is to express the feasible region as an intersection of the 
following two sets: 


C s d = {S 2 : G <S+ +1 , cond(f 2 ) < k} , and 
Cun = : tr(Ajfi) = bj, j = 1,..., N + 1} . 


Given that both of these sets are convex, projecting onto their intersection can be computed 


with the Directional Alternating Projection Algorithm (Gubin et al., 1967). This method makes 
progress by repeatedly projecting onto the sets C s d and C; m . Consequently, it is efficient only if 
projecting onto each of the individual sets is fast. Fortunately, as will be shown next, this turns 
out to be the case. 

First, projecting an (N + 1) x (N + 1) symmetric matrix M = {ray} onto C/ m is a linear 
map. To make this more specific, let m = vec(M) be a column-wise vectorization of M. If A 
is a matrix with the j th row equal to vcc(A ; ), the linear constraints in ([ 6 ]) can be expressed as 
Am = e\. Then, the projection of M onto C{ m is given by vec _ 1 (m+A / (AA / )^ 1 (e 1 — Am)). 
This expression simplifies significantly by close inspection. In fact, it is equivalent to setting 
mu = 1 and for j > 2 replacing m ? |, rri\ :) , and rn :j] by their average (rrijj + i + my)/3. 
Denote this projection with the operator Vu n (-). 


Second, Tanaka and Nakata (2014) describe a univariate optimization problem that is al¬ 
most equivalent to projecting M onto C s d■ The only difference is that their solution set also 
includes the zero-matrix 0. Assuming that such a limiting case can be safely handled in the 
implementation, their approach offers a fast projection onto C s d even for a moderately large N. 
To describe this approach, consider the spectral decomposition M = QDiag(Zi,..., In+i)Q' 


22 









and the univariate function 


N +1 

Av) = XI [(^ “ Z *) + + & _ K f i )+] ’ 

2=1 

where Diag(x) is a diagonal matrix with diagonal x and (•)+ is the positive part operator. The 
function 7r(/r) can be minimized very efficiently by solving a series of smaller convex problems, 
each with a closed form solution. The result is a binary-search-like procedure described by 
Algorithm ?? in Appendix A. If //* = arg min >0 n (//) and 

//* if < fi* 

< K/l* if K/l* < lj 

L otherwise 

for j — 1,..., N + 1, then QDiag(Aj,..., A^ +1 )Q is the projection of M onto C sd . Call this 
projection V sd {- : n). 

Algorithm |T| uses these projections to solve ([6]). Each iteration projects twice on one set 
and once on the other set. The general form of the algorithm does not specify which projection 
should be called twice. Therefore, given that V s d( m '■ K ) takes longer to run than Vu n {-), it is 
beneficial to choose to call Vi m (•) twice. The complexity of each iteration is determined largely 
by the spectral decomposition which is fairly fast for moderately large N. Overall time to 
convergence, of course, depends on the choice of the stopping criterion. Many intuitive criteria 
are possible. Given that ftp £ and S7 C e C sd , the stopping criterion max{(fi D — ^ic)h } < 
e suggests that the return value is in C sd and close to C/ in in every direction. Based on our 
experience, the algorithm converges quite quickly. For instance, our implementation in C++ 
generally solves <[6j) for e = 10 -5 and N = 100 in less than a second on a 1.7 GHz Intel Core 
i5 computer. This code will be made available online upon publication. For the remainder of 
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Require: Unconstrained covariance matrix estimator S, stopping criterion e > 0, and an upper 
bound on the condition number k > 1. 
i: procedure Directional Alternating Projection Algorithm 
2: ft A V- h(S ) 

3: repeat 

4: f2 s VuniyiA ) 

5 : VsdiS^B '■ K ) 

6 : <(— Vu n (ftc) 

7: A ^—| |^/tr [(f2^ — — f2c)] 

8: 12^4 4— + A(f2/5 — 

9: until max j (Sin — < e 

10: return /r -1 (f2c) 

11: end procedure 

Algorithm 1: This procedure projects h(S) onto the intersection C sd (T C lin . Denote the projec¬ 
tion with Vlse{ S : k). Throughout the paper, the stopping criterion is fixed at e = 1CT 5 . 

the paper, projecting S onto the feasible region is denoted with the operator Vlse {S : k). 

3.4 Selecting k 

The estimation procedure described in the previous section has one tuning parameter, namely 
the condition number threshold k. This subsection discusses an in-sample approach, called 
conditional validation , that can be used for choosing any tuning parameter, such as k, under 
the partial information framework. To motivate, recall that the revealed aggregator X" uses £ 
to regress Z (] on the rest of the Z ; s. Of course, the accuracy of this prediction cannot be known 
until the actual outcome is observed. However, apart from being unobserved, the variable 
Z 0 is theoretically no different to the other ZjS. This suggests the following algorithm: for 
some value v compute Vlse {S : v), let each of the Z :j s in turn play the role of Z 0 , predict its 
value based on Z % for i ^ j, and choose the value of v that yields the best overall accuracy. 
Even though many accuracy measures could be chosen, this paper uses the conditional log- 
likelihood. Therefore, if Z* = (Z ;l ,..., Zjk) 1 collects the jth forecaster’s information about 
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the K events, the chosen value of k is 


N 

k cov = arg max V] i (Z*, Vlse{ S : u) \ Z* for i =2 j) , (7) 

where the log-likelihood is now conditional on Z*s for i ^ j and S is computed based on all 
the forecasts Z|,..., Z^. Plugging this into the projection algorithm gives the final estimate 

^ cov ^Plse {S : k cov ) • 

Unfortunately, the optimization problem ([7]) is non-convex in is. However, as was men¬ 
tioned before, Algorithm [I] is fast for moderately sized N. Therefore k can be chosen effi¬ 
ciently (possibly in parallel on multicore machines) over a grid of candidate values. Overall, 
the idea in conditional validation is similar to cross-validation but, instead of predicting across 
rows (observations), the prediction is performed across columns (variables). This not only 
mimics the actual process of revealed aggregation but is also likely to be more appropriate for 
prediction polling that typically involves a large number of forecasters (large N) predicting 
relatively few events (small K). Furthermore, it has no tuning parameters and remains more 
stable when K is small; see Appendix B for an illustration of this result under synthetic data. 

4. APPLICATIONS 

This section applies the partial information framework to different types of real world forecasts. 
For each type there may be different ways to adopt the Gaussian model. The main point, how¬ 
ever, is not to find the optimal way to do this but rather to give illustrative examples on using 
the framework and also to show how the resulting partial information aggregators outperform 
the commonly used measurement error aggregators. 


25 


4.1 Probability Forecasts of Binary Outcomes 

4.1.1 Dataset 


During the second year of the Good Judgment Project (GJP) the forecasters made probability 
estimates for 78 events, each with two possible outcomes. One of these events was illustrated 
in Figure [I] Each prediction problem had a timeframe, defined as the number of days between 
the first day of forecasting and the anticipated resolution day. These timeframes varied largely 
among problems, ranging from 12 days to 519 days with a mean of 185.4 days. During each 
timeframe the forecasters were allowed to update their predictions as frequently as they liked. 
The forecasters knew that their estimates would be assessed for accuracy using the quadratic 


loss (often known as the Brier score; see Brier 1950 for more details). This is a proper loss func¬ 
tion that incentivized the forecasters to report their true beliefs instead of attempting to game 
the system. In addition to receiving $150 for meeting minimum participation requirements that 
did not depend on prediction accuracy, the forecasters received status rewards for their perfor¬ 
mance via leader-boards displaying the losses for the best 20 forecasters. Depending on the 
details of the reward structure, such a competition for rank may eliminate the truth-revelation 
property of proper loss functions (see, e.g., |Lichtendahl Jr and Winklert2007| ). 

This data collection raises several issues. First, given that the current paper does not focus 
on modeling dynamic data, only forecasts made within some common time interval should be 
considered. Second, not all forecasters made predictions for all the events. Furthermore, the 
forecasters generally updated their forecasts infrequently, resulting into a very sparse dataset. 
Such high sparsity can cause problems in computing the initial unconstrained estimator S. 
Evaluating different techniques to handle missing values, however, is well outside the scope of 
this paper. Therefore, to somewhat alleviate the effect of missing values, only the hundred most 
active forecasters are considered. This makes sufficient overlap highly likely but, unfortunately, 
still not guaranteed. 
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All these considerations lead to a parallel analysis of three scenarios: High Uncertainty 
(HU), Medium Uncertainty (MU), and Low Uncertainty (LU). Important differences are sum¬ 
marized in Table |T] Each scenario considers the forecasters’ most recent prediction within a 
different time interval. For instance, LU only includes each forecaster’s most recent forecast 
during 30 — 60 days before the anticipated resolution day. The resulting dataset has 60 events 
of which 13 occurred. In the corresponding 60 x 100 table of forecasts, around 42 % of the 
values are missing. The other two scenarios are defined similarly. 

Table 1: Summary of the three time intervals analyzed. Each scenario considers the forecasters’ 
most recent forecasts within the given time interval. The value in the parentheses represent the 
number of events occurred. The final column shows the percent of missing forecasts. 


Scenario 

Time Interval 

# of Events 

Missing (%) 

High Uncertainty (HU) 

90 - 120 

49 (10) 

51 

Medium Uncertainty (MU) 

60-90 

56 (14) 

46 

Low Uncertainty (LU) 

30-60 

60 (13) 

42 


4.1.2 Model Specification and Aggregation 

The first step is to pick a link function and derive a Gaussian model for probability forecasts of 
binary events. Overall, this construction resembles in many ways the latent variable version of 
a standard probit model. 

Model Instance. Identify the /. th event with Y/, e {0,1}. These outcomes link to 
the information variables via the following function: 


Yk — g{z ok ) 


j 1 if Z ok > t k 
I 0 otherwise, 


where t k e [R is some threshold value. Therefore the link function g(-) is simply 
the indicator function l^ k of the event A k = {Z ok > t k }. This threshold is defined 
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by the prior probability of the fcth event P (Y k = 1) = <I>(— t k j, where <&(•) is the 
CDF of a standard Gaussian distribution. Given that the thresholds are allowed to 
vary among the events, each event has its own prior. The corresponding probability 
forecasts X jk e [0,1] are 


X jk = E(Y k \Z jk ) = $ 


Zjk t k 


In a similar manner, the revealed aggregator X" e [0,1] for event k is 


X'l = E(Y t \Z t ) = 4 | diaggrs-^ - 

1 y 1 - diag(S)'S~idiag(S) J 


( 8 ) 


All the parameters of this model can be estimated from the data. The first step is to specify a 
version of the unconstrained estimate S. If the t k s do not change much, a reasonable and sim¬ 
ple estimate is obtained by transforming the sample covariance matrix Sp of the probit scores 
Pj k := <&~ l {Xj k ). More specifically, if D := Diag(d)Diag(l + d) -1 , where d = diag(Sp), 
then an unconstrained estimator of £ is given by S = (Ijv - D) 1 /2 S p(I JV - D) 1 / 2 . Recall 
that the GJP data holds many missing values. This is handled by estimating each pairwise 
covariance in Sp based on all the events for which both forecasters made predictions. Next, 
compute T, cov , where k cov is chosen over a grid of 100 candidate values between 10 and 1,000. 
Finally, the threshold t k can be estimated by letting P/ = (P \ k ,..., / , VA•) , , observing that 
—Diag(l — diag(£)) 1 / 2 P^ ~ A/jv(ffcljv, E), and computing the precision-weighted average: 


tk 


1'S^l 


If P/ has missing values, the corresponding rows and columns of £ ccw are dropped. Intuitively, 
this estimator gives more weight to the forecasters with very little information. These estimates 
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are then plugged in to ([8| to get the revealed aggregator X" ov . 

This aggregator is benchmarked against the state-of-the-art measurement-error aggrega¬ 
tors, namely the average probability, median probability, average probit-score, and average 
log-odds. Unequally weighted averages were not considered because it is unclear how the 
weights would be determined based on forecasts alone, and even if this could be done some¬ 
how (perhaps based on self-assessment or organizational status), using unequal weights often 
leads to no or very small performance gains (Rowse et al.[ 1974} [Ashton and Ashton[ 1985} Flo¬ 


res and White, 1989). To avoid infinite log-odds and probit scores, extreme forecasts X jk = 0 


and 1 were censored to X 3k = 0.001 and 0.999, respectively. The results remain insensitive to 
the exact choice of censoring as long as this is done in a reasonable manner to keep the extreme 
probabilities from becoming highly influential in the logit- or probit-space. The accuracy of 
the aggregates is measured with the average root-mean-squared-error (RMSE). Note that this 
is nothing but the square root of the commonly used Brier score. Instead of considering all the 
forecasts at once, the aggregators are evaluated under different N via repeated subsampling of 
the 100 most active forecasters; that is, choose N forecasters uniformly at random, aggregate 
their forecasts, and compute the RMSE. This is repeated 1,000 times with N = 5,10,..., 65 
forecasters. Due to high computational cost, the simulation was stopped after N = 65. In the 
rare occasion where no pairwise overlap is available between one or more pairs of the selected 
forecasters, the subsampling is repeated until all pairs have at least one problem in common. 

Figure [3] shows the average RMSEs under the three scenarios described in Table [T] Here a 
reasonable upper bound is given by 0.5 as this is the RMSE one would receive by constantly 
predicting 0.5. All presented scores, however, are well below it and improve uniformly from 
left to right, that is, from HU to LU. This reflects the decreasing level of uncertainty. In all 
the figures the measurement-error aggregators rank in the typical order (from worst to best): 
average probability, median probability, average probit, and average log-odds. Regardless of 
the level of uncertainty, the revealed aggregator X” ov outperforms the averaging aggregators as 
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- Average Median - Log-odds - Probit - X' cov 




(a) High Uncertainty (HU) 


(b) Medium Uncertainty (MU) 


(c) Low Uncertainty (LU) 


Figure 3: Average prediction accuracy over the 1,000 sub-samplings of the forecasters. See 
Table [Tjfor descriptions of the different scenarios. 


long as K > 10. The relative advantage, however, increases from HU to LU. More specifically, 
the improvement from Log-odds to X" ov is about 2%, 17%, and 21% in HU, MU, and LU, re¬ 
spectively. This trend can be explained by several reasons. First, as can be seen in Table [Tj the 
amount of data increases from HU to LU. This yields a better estimate of £ and hence more 
accurate revealed aggregation. Second, the forecasters are more likely to be well-calibrated 


under MU and LU than under HU (see, e.g., Braun and Yaniv 1992). Third, under HU the 
events are still inherently very uncertain. Consequently, the forecasters are unlikely to hold 
much useful information as a group. Under such low information diversity, measurement-error 
aggregators generally perform relatively well (Satopaa et al .|2015| . In the contrary, under MU 
the events have lost a part of their inherent uncertainty, allowing some forecasters to possess 
useful private information. These individuals are then prioritized by X” ov while the averaging- 
aggregators continue treating all forecasts equally. Consequently, the performance of the mea¬ 
surement error aggregators plateaus after N = 30 or so. Therefore having more than about 30 
forecasters does not make a difference if one is determined to aggregate their predictions using 


the measurement error techniques; a similar results was reported by Satopaa et al. 2014a In 
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contrast, however, the RMSE of X” ov continues to improve linearly in N, suggesting that X” nv 
is able to find some residual information in each additional forecaster and use this to increase 
its performance advantage. 


4.1.3 Information Diversity 

The GJP assigned the forecasters to make predictions either in isolation or in teams. Further¬ 
more, after the first year of the tournament, the top 2% forecasters were elected to the elite 
group of “super-forecasters.” These super-forecasters then worked in exclusive teams to make 
highly accurate predictions on the same events as the rest of the forecasters. Overall, these 
assignments directly suggest a level of information overlap. In particular, recalling the inter¬ 


pretation of £ from Section 2.2.1 super-forecasters can be expected to have the highest b 3 s 
and forecasters in the same team should have a relatively high p l3 . This subsection examines 
how well £ co „ aligns with this prior knowledge about the forecasters’ information structure. 

For the sake of brevity, only the FU scenario is analyzed as this is where X”^ presented the 
highest relative improvement. The associated 100 forecasters involve 36 individuals predicting 
in isolation, 33 forecasting team-members (across 24 teams), and 31 super-forecasters (across 


5 teams). Figure |4a| displays £ co „ for the five most active forecasters. This group involves two 
forecasters working in isolation (Iso. A and B) and three super-forecasters (Sup. A, B, and 
C), of whom the super-forecasters A and B are in the same team. Overall, X cov agrees with 
this classification: the only two team members, namely Sup. A and B have a relatively high 
information overlap. In addition, the three super-forecasters are more informed than the non- 
super-forecasters. Such a high level of information unavoidably leads to higher information 
overlap with the rest of the forecasters. 

By and large, this agreement generalizes to the entire group of forecasters. To illustrate, 
Figure |4b] displays £ co „ for all the 100 forecasters. The information structure has been ordered 
with respect to the diagonal such that the more informed forecasters appear on the right. Fur- 
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Isolation 


Team 


Super 


Iso. A Iso. B Sup. A Sup. B Sup. C 




(a) £ C(W for the five most active forecasters (b) £ ccw for all 100 forecasters shows high infor¬ 
mation diversity. 


Figure 4: The estimated information structure £ under the LU scenario. Each forecaster 
worked either in isolation, in a non-super-forecaster team, or in a super-forecaster team. The 
super-forecasters generally have more information than the forecasters working in isolation. 


thermore, a colored rug has been appended on the top. This rug shows whether each forecaster 
worked in isolation, in a non-super-forecaster team, or in a super-forecaster team. Observe that 
the super-forecasters are mostly situated on the right among the most informed forecasters. 
The average estimated 5j among the super-forecaster is 0.80. On the other hand, the average 
estimated Sj among the individuals working in isolation or in non-super-forecaster teams are 
0.47 and 0.50, respectively. Therefore working in a team makes the forecasters’ predictions, 
on average, slightly more informed. 

In general, a plot such as Figure [4b] is useful for assessing the level of information diver¬ 
sity among the forecasters: the further away it is from a monochromatic plot, the higher the 
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information diversity. That being said, the colorful Figure [4b] suggests that the GJP forecast¬ 
ers have high information diversity. This makes sense as these forecasters were asked to make 
predictions about international political events. Given that on such events the forecasters’ back¬ 
ground knowledge, education, how closely they follow the news, and so on matter, one should 
expect a high level of information diversity. Therefore not only does X” ov clearly outperform 
the common measurement error aggregators in terms of prediction accuracy but the Gaussian 
model also captures true structure in the data. 


4.2 Point Forecasts of Continuous Outcomes 


4.2.1 Dataset 


Moore and Klein (2008) hired 415 undergraduates from Carnegie Mellon University to guess 


the weights of 20 people based on a series of pictures. These forecasts were illustrated in 
Figure [2j The target people were between 7 and 62 years old and had weights ranging from 61 
to 230 pounds, with a mean of 157.6 pounds. All the students were shown the same pictures 
and hence given the exact same information. Therefore any information diversity arises purely 
from the participants’ decisions to use different subsets of the same information. Consequently, 


information diversity is likely to be low compared to Section 4.1 where diversity also stemmed 
from differences in the information available to the forecasters. 

Unlike in Section |4~Tj the Gaussian model can be applied almost directly to the data. Only 


the effect of extreme values was reduced via a 90% Winsorization (Hastings et al. 1947). This 
handled some obvious outliers. For instance, the original dataset contained a few estimates 
above 1000 pounds and as low as 10 pounds. Winsorization generally improved the perfor¬ 
mance of all the competing aggregators. 
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4.2.2 Model Specification and Aggregation 

Model Instance. Suppose Y k and X ;]k are real-valued. If the proper non-informative 
prior distribution of Y k is Af(n 0 k, Oq), then Y k = g(Z ok ) = Z 0k a 0 + n ok . Con¬ 
sequently, Xj k = E.(Y\Zj k ) = Zj k o o + /i 0 fe for all j = 1 ,..., IV. Therefore 
rv_/ J\f(Hok, tf) for some crj < af r If Z k = {Z \ k ,..., Z Nk ) , then the revealed 
aggregator for the A th event is 

X k = E(y fe |Z fc ) = diag(S) / S ll Z k cro + fio k . (9) 


Under this model the prior distribution of Y k is specified by n ok and Oq. Given that 


E (Xj k ) = //()/,. for all j = 1 ,N, the sample average fi ok = X jk /N provides an 
initial estimate of fi ok . The value of a k can be estimated by assuming a distribution for the cr|s. 
More specifically, let aj be i.i.d. on the interval [0, <7g] and use the resulting likelihood to esti¬ 
mate Cq. For instance, a non-informative choice is to assume aj l '~' U{ 0, a k ), which leads to 
the maximum likelihood estimator max{cr?}. This has a downward bias that can be corrected 
by a multiplicative factor of (N + 1)/A r . Therefore, replacing aj with the sample variance 
Sj = Ylk=i(Xjk ~ Ao k) 2 /(K ~ 1) gives the final estimate dg = (N + 1 )/N max{sj}. Using 
these estimates, the Xj k s can be transformed into the Zj k s whose sample covariance matrix 
provides the unconstrained estimator for the projection algorithm. The value of n cov is chosen 
over a grid of 10 values between 10 and 10, 000. Once £ co „ has been computed, the prior 
means are updated with the precision-weighted averages fi ok = (X' fc £j 0 jl N ) /(l^S^l N ). In 
the end, all these estimates are plugged in ([9]) to get the revealed aggregator X" ov . 

This aggregator is compared against the average, median, and average of the median and 


average (AMA). The last competitor, namely AM A is a heuristic aggregator that Lobo and Yao 


(2010) showed to work particularly well on many different real-world forecasting datasets. 
In this section the overall accuracy is measured with the RMSE averaged over 10, 000 sub- 
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Figure 5: Average prediction accuracy Figure 6: Y, cov for all 416 forecasters shows 

low information diversity. 

samplings of the 416 participants. That is, each iteration chooses N participants uniformly at 
random, aggregates their forecasts, and computes the RMSE. The size of the sub-samples is 
varied between 10 and 100 with increments of 10. These scores are presented in Figure [5} The 
average outperforms the median across all N. The performance of AMA falls between that of 
average and median, reflecting its nature as a compromise of the two. The revealed aggregator 
X" ov is the most accurate once N > 10. The relatively worse performance at A r = 10 suggests 
that 10 observations is not enough to estimate jx Qk accurately. As N approaches 100, however, 
X" ov collects information efficiently and increases the performance advantage against the other 
aggregators. 

Figure [6] shows for all the 416 forecasters. Similarly to before, the matrix has been 
ordered such that the most knowledgeable forecasters are on the right. Overall, this plot is much 
more monochromatic than the one presented earlier in Figure |4b} suggesting that information 
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diversity among the 416 students is rather lower. This aligns with the expectations laid out 


earlier in Section 4.2.1 If there were no information diversity, i.e., all the forecasters used the 
same information, then averaging aggregators, such as the simple average, would perform very 
well (Satopaa et al.j, j2015j ). Such a limiting case, however, is rarely encountered in practice. 
Often at least some information diversity is present. The results in the current section show 
that the revealed aggregator does not require extremely high information diversity in order to 
outperform the measurement-error aggregators. 


5. DISCUSSION 


This paper introduced the partial information framework for modeling forecasts from different 
types of prediction polls. Even though the framework can be used for theoretical analysis and 
studying information among groups of experts, the main focus was on model-based aggrega¬ 
tion of forecasts. Such aggregators do not require a training set. Instead, they operate under 
a model of forecast heterogeneity and hence can be applied to forecasts alone. Under the par¬ 
tial information framework, all forecast heterogeneity stems from differences in the way the 
forecasters use information. Intuitively, this is more plausible at the micro-level than the histor¬ 
ical measurement error. To facilitate practical applications, the partial information framework 
motivates and describes the forecasters’ information with a patterned covariance matrix (Equa¬ 
tion |T]). A correctional procedure was proposed (Algorithm [Tj) as a general tool for estimating 
these information structures. This procedure inputs any covariance estimator and modifies it 
minimally such that the final output represents a physically feasible allocation of information. 
Even though the general partial information framework describes an optimal aggregator, it is 
generally too abstract to be directly applied in practice. As a solution, this paper discusses a 
close yet practical specification within the framework, known as the Gaussian model (Section 


2.2.2). The Gaussian model permits a closed-form solution for the optimal aggregator and ex¬ 
tends to different types of forecast-outcome pairs via a link function. These partial information 
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aggregators were evaluated against the common measurement error aggregators on two dif¬ 
ferent real-world (Section [4]) prediction polls. In each case the Gaussian model outperformed 
the typical measurement-error-based aggregators, suggesting that information diversity is more 
important for modeling forecast heterogeneity. 

Generally speaking, partial information aggregation works well because it downweights 
pairs or sets of forecasters that share more information and upweights ones that have unique 


information (or choose to attend to unique information as is the case, e.g., in Section 4.2 


where forecasters made judgments based on the same pictures). This is very different from 
measurement-error aggregators that assume all forecasters to have the same information and 
hence consider them equally important. While simple measurement-error techniques, such as 
the average or median, can work well when the forecasters truly operate on the same informa¬ 
tion set, in real-world prediction polls participants are more likely to have unequal skill and 
information sets. Therefore prioritizing is almost certainly called for. Of course, the more di¬ 
verse these sets are, the better the partial information aggregators can be expected to perform 
relative to the measurement error aggregators. To illustrate this result, compare the relative 
performances in Section |4.1| (high information diversity) against those in Section |4.2| (low in¬ 
formation diversity). 

Overall, the partial information framework can be applied and extended in many different 
ways. For instance, in this paper the jth forecaster’s prediction was assumed to be the expec¬ 
tation of Y after observing some partial information T :r In some applications, however, other 
constructs, such as the conditional median or other quantiles, may be more appropriate. Such 
extensions can be handled by considering the distribution of Y\J r J and then equating the jth 
forecaster’s prediction to any desired functional of this distribution. This is particularly easy 
under the Gaussian model, where Y\Tj conveniently follows a Gaussian distribution. 

In terms of future research, the partial information framework offers both theoretical and 
empirical directions. One theoretical avenue involves estimation of information overlap. In 
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some cases the higher order overlaps have been found to be irrelevant to aggregation. For 
instance, DeGroot and Mortera|( 1991| ) show that the pairwise conditional (on the truth) distri¬ 
butions of the forecasts are sufficient for computing the optimal weights of a weighted average. 
Theoretical results on the significance or insignificance of higher order overlaps under the 
partial information framework would be desirable. Given that the Gaussian model can only ac¬ 
commodate pairwise information overlap, such a result would reveal the need of a specification 
that is more complex than the Gaussian model. 

A promising empirical direction is the Bayesian approach. These techniques are very nat¬ 
ural for fitting hierarchical models such as the ones discussed in this paper. Furthermore, in 
many applications with small or moderately sized datasets, Bayesian methods have been found 
to be more stable than the likelihood-based alternatives. Therefore, given that the number of 
forecasts in a prediction poll is typically quite small, a Bayesian approach is likely to improve 
the quality of the final aggregate. This would involve developing a prior distribution for the 
information structure - a problem that seems interesting in itself. Overall, this avenue should 
certainly be pursued, and the results tested against other high performing aggregators. 
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